Index | Exploratory Data Analysis | NLP | ML | Conclusion
NLP Page - Executive Summary
Elon Musk, the owner of Tesla, has been put in the spotlight as his decisions, including that of purchasing Twitter, are often viewed as spontaneous and reckless. Despite Twitter’s effort of The Poison Pill defensive tactic, Elon has successfully acquired Twitter and has begun its reconstruction process. Yet, Elon Musk’s reputation has been controversial long ago. As we explore a large dataset that consists of both comments and submissions on Reddit, an American social news aggregation, content rating, and discussion website, we can explicitly understand the rating, reputation, and relations of Elon Musk from 2021 to the end of August 2022.
Fig. 1a & 1b - Positive Word Cloud on Elon Musk, Negative Word Cloud on Elon Musk
Firstly, through the text-cleaning application of Natural Language Processing (NLP), we can see the abundance and importance of keywords that are linked to Elon Musk. In addition, we can separate positive words and negative words, and create side-to-side word clouds that can delineate how the fact that common words, such as “Twitter” and “make” appear in both the positive and negative word clouds. This illustrates that not only are Elon Musk’s comments or reputation controversial, but the topics linked to Elon Musk are also controversial in their context.
Fig. 2 - Bar Graph Illustrating Reddit Sentiment Distribution towards Elon Musk from 2021/01/01 to 2022/08/31
Secondly, as seen in the table above, the ratios between positive and negative Reddit sentiment are similar across three flairs: “Elon”, “General” and “Tweets”. This could be correlated to the fact that most Reddit users use the Reddit platform as a place to speak out their inner thoughts. We initially thought that the difference between the number of positive and negative sentiments for the flair “Elon” would be higher as Elon Musk shares a controversial reputation on the news and social media. Yet, the ratio between positive and negative sentiments is quite similar for both “General” and “Tweets” flairs, when compared to “Elon” flairs. This shows that although Elon Musk has more negative than positive sentiments, it is technically “no more negative” than the two other flairs. In addition, the sentiments ratio of the flair “Twitter” is proportionally a lot higher than that of “Elon”. This is exceptionally interesting and odd at the same time as most newspapers and other social media resources have been placing Elon Musk’s decision to acquire Twitter as Elon’s ambition. And that Twitter is the “unfortunate” company that tries to escape from the lion’s claw. Do Reddit users genuinely think of Twitter badly as well?
NLP Page - Data Analysis Report
Via NLP processing, we are able to achieve business goals 1, 9 and 10 comprehensively, while the other business goals are lightly touched upon due to changes in our general business goals. Our business goal for the NLP portion primarily focuses on the reputation, comment quality and sentiments of tech companies and their chief executive officers. We have chosen Elon Musk as a starting point to classify his influence by reviewing sentiments post texts that involves Elon Musk extensively. In the beginning, we start by filtering submission posts under the subreddit topic ‘elonmusk’. After viewing the top 15 link flairs (‘link_flair_text’) based on submission number, we have chosen and subsetted the dataframe with link flairs “Elon”, “General”, “Tweets”, “null”, and “Twitter” to explore more.
The above is a timeseries graph that illustrates the trend of submission quantities on Elon Musk from January 2021 to August 2022 with the above four link flairs. It can be seen that the submissions with flair Elon are especially high on dates Jan 8, 2021 (Elon Musk’s comment on Twitter’s ban on Trump after Capitol Attack), May 9, 2021 (How Elon Musk would reverse Twitter’s ban on Trump, Dec 31, 2021 and Apr 14, 2022 (Acquisition of Twitter by Elon Musk). In addition, it can be seen that the flair “Twitter” did not quite appear until Apr 12, 2022, which was two days before the announcement of Elon’s Acquisition of Twitter. By then, the flair “tweet” may mostly indicate Elon’s interesting tweets, as Elon is also a big fan of using tweets as a way to verbalize his thoughts to the world. This graph touches upon business goal 5 and 6 which deduce which flairs are popular at which time. The trend of flairs can indicate important social events and controversy going on in the community. Significant dates, such as the above can correspond with both apexes on the graph and hit news publication dates.
Fig. 4 Graph showing Text Length Distribution of submissions in ‘elonmusk’ subreddit from 2021/01/01 to 2022/08/31
The size of the dataframe we are working with has 10668 rows, which is a perfect size to undergo NLP processing. In terms of basic text data checks, we have counted the most common words in the dataset, which includes mostly stop words before word processing. The word “Elon” has the count of 339, ranking around 10th overall. On the other hand, as seen in the figure above, the distribution of text submission length mainly revolves between 1 to 100 words. Although the characters limit on Reddit are 10,000 characters, which is approximately 1400-2000 words, most reddit users only write a sentence or a small paragraph to verbalize their thoughts regarding Elon Musk. This graph illusrates business goal 6 and 10 as the length of text may variate with popularity. In this case, the length of most text can be concluded within 1-3 sentences.
In terms of text data cleaning, we are using johnsnowlabs sparkNLP with 5 main data cleaning features. This includes Tokenizer, Normalizer, LemmatizerModel, StopWordsCleaner and Stemmer. We first remove English stopwords from our dataset as text data can contain a lot of stopwords that can interfere with the other processes that look into more meaningful and polarized words. We then set up the pipeline by the following order. We will first tokenize texts using the tokenizer to break down sentences into understandable individual words. We will then normalize texts into all lowercase. In addition, we will use lemmatizer and stemmer in order to return various forms of words into their bases by recognizing roots forms and context they are placed in. It is important to note that a dictionary is needed for lemmatizer, therefore a pretrained model is used. Lastly, the finisher will convert tokens into readable output - string. After designing the pipeline, the text submissions will go through transformation via first pipeline of text data cleaning. The second part is to compute the Sentiment Analysis. The major step in the second pipeline involves using the pretrained Vivekn Sentiment Model to get the final sentiment of the text submissions. The final sentiment is extremely crucial as it can indicate if the lemmatized string is leaning towards positive or negative. This can show the influences and reputation of Elon Musk by counting positive or negative strings and how time can be a factor that rolls into his influence.
Fig. 5 - General Word Cloud with Elon Musk Silhoutte
After completing the sentiment analysis, we are able to create interesting word clouds with our dataset, with the additon of a sentiment column, assigning the text submission to either being positive or negative.The word cloud above is a general word cloud with Elon Musk’s Silhoutte. It can be seen that the most common words associated with Elon Musk are “twitter’, ‘make’, tesla’,‘think’,‘say’,‘thing’ and etc. The clusters of words are quite similar for the figures below, which contain positive sentiments on the left and negative sentiments on the right. The overall of words is very interesting as it exhibits how topics revolving those words could be controversial in a way that reddit users may have various views on them.
Fig. 6a & 6b - Positive Word Cloud on Elon Musk, Negative Word Cloud on Elon Musk
In general, word clouds are a good way to help reach business goals 6-10, which mainly aim to explore controversy and also the idea of keywords and important elements that gain popularity on the internet.in this case the word clouds show certain areas of interest that the public wants to discuss, learn and know about Elon Musk.
As most of the public know, on April 4 2022, Elon Musk announced that he had acquired 9.2 percent of the company’s shares, totaling $2.64 billion, making him Twitter’s largest shareholder.Musk stated that he planned to introduce new features to the platform, make its algorithms open-sourced, combat spambot accounts, and promote free speech. This matches the month where he received the most positive and negative submissions in total, compared to other months.
Fig. 8 - Graph Illustrating TSLA Stock Price and No. of Submissions Related to Elon Musk from 2021/01/01 to 2022/08/31
After adding the Tesla stock price trend data from Yahoo Finance into a histogram of the no. of submissions related to Elon Musk from 2021/01/01 to 2022/08/31. It is evident that the controversy matches the drop in Tesla’s stock price after the announcement of being the largest shareholder in Twitter.
Table 1 - Summary Table Illustrating Count of Text With or Without Elon Musk or Twitter and its Sentiment
Overall, by looking at the summary table, we can look at the combinations of submission with or without elon musk or twitter as a dummy variable and realize their sentiments. In this case, the highest quantity of submissions that involves Elon Musk are negative submissions that mention Elon Musk with or without twitter.